The San Francisco Police Department (SFPD) Stop Data was designed to capture information to comply with the Racial and Identity Profiling Act (RIPA), or California Assembly Bill (AB)953. SFPD officers collect specific information on each stop, including elements of the stop, circumstances and the perceived identity characteristics of the individual(s) stopped. The information obtained by officers is reported to the California Department of Justice. This dataset includes data on stops starting on July 1st, 2018, which is when the data collection program went into effect
This dataset includes information about police stops that occurred, including some details about the person(s) stopped, and what happened during the stop. Each row is a person stopped with a record identifier for the stop and a unique identifier for the person. A single stop may involve multiple people and may produce more than one associated unique identifier for the same record identifier.
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(dplyr)library(ggplot2)library(hrbrthemes)
NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(viridis)
Loading required package: viridisLite
library(MASS)
Attaching package: 'MASS'
The following object is masked from 'package:plotly':
select
The following object is masked from 'package:dplyr':
select
library(reshape2)
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
library(reshape)
Attaching package: 'reshape'
The following objects are masked from 'package:reshape2':
colsplit, melt, recast
The following object is masked from 'package:plotly':
rename
The following object is masked from 'package:dplyr':
rename
The following objects are masked from 'package:tidyr':
expand, smiths
The following object is masked from 'package:lubridate':
stamp
library(ggmap)
ℹ Google's Terms of Service: <https://mapsplatform.google.com>
Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service/>
OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles/>
ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
Attaching package: 'ggmap'
The following object is masked from 'package:plotly':
wind
In this project to investigate police stop data and traffic accident that caused a person injury in San Francisco data is used.
One of the dataset is between in 2018-2023 (Police Stop Data), and the other is (Accident data) between in 2005-2023. However, to worked in same period it was selected as 2018-2023.
There is a 244934 observation and 87 variable in Police Stop data, and 57456 observation and 57 variable in Accident data.
Firstly, structure of the Police Stop data was investigate only numeric variables are numeric and others even if date variable was also character variable, date variable changed as date with the function from lubridate package, and the monthly trends of number of stops is check with the interactive plotly line plot.
There was a huge dramatic decrease in February 2020, probably pandemic situation is affected the rate of the stop people to be safe from Covid with less interaction. May be police officer just give the penalty tickets without stopping.
stop_monthly <- stop |>group_by(date = lubridate::floor_date(date, 'month')) |>summarize(count =sum(count)) #str(stop_monthly)fig <-plot_ly(stop_monthly, x =~date, y =~count, type ='scatter', mode ='lines') |>layout(title ='Trend of number of people stoped by police 2018-2023', plot_bgcolor ="#e5ecf6", xaxis =list(title ='Date'), yaxis =list(title ='Count of stop'))fig
In this dataset, many police stopped value are collected but some of them after collection as changed status as deleted, in this project only “Completed - Successful Submission” status data included.
stop_age <- stop |>group_by(gender =as.factor(perceived_gender), age = perceived_age) |>summarize(count =sum(count))ggplot(stop_age, aes(x = age , y = count, colour = gender )) +geom_line()
Therefore, some demographic information are checked. Firstly, gender are ages analysed with line plot. The most of them are male and their ages is in between 25 to 50. Other most occurent gender is female with same age group. There is a very small group of transgender people. But it can be seen in this plot, there are some outlier in age, in California getting driver licence age is 16.5 but data starts from 1, also there should be no people to drive after 80 years old but there is some values in close of 125. So, the data is filtered as age between 16 and 80.
summary(factor_stop$perceived_race_ethnicity)
Asian Black/African American
27041 59846
Hispanic/Latino(a) Middle Eastern or South Asian
47937 16152
Multi-racial Native American
5989 356
Pacific Islander White
2959 84654
summary(factor_stop$perceived_age_group)
18 - 29 30 - 39 40 - 49 50 - 59 60 or over Under 18
64295 78355 51320 33524 16673 767
After checking age and gender values, ethnicity of person is investigated. In USA many African American people blame the police officers according to they are becoming suspicious because of their races. However, according to population values of San Francisco, the only 5 % of people are African American in total population, so even if the small population rate, the police officers usually stop African Americans. Furthermore, in the all age groups white people are mostly stopped by police officers, this makes sence since almost half of the population is constructed by the white people. Also, hispanic peoples’ rate of population is 15 % and they have larger stopped rate than the other ethnicities.
Accident Data vs Police Stop Data Trend
To checked is there any relationship between stop rate and crash rate, it was used Traffic Accident data collected from San Francisco government open data library, and corrected date values like Police Stop data.
In below line plot it can be seen that there is no changes in trend of the number of crashes even if in pandemic, it is stationary.
Duration Time according to Demographic Informations
When the police officer is stopped the person, is the ethnicity or the age affected the stop duration time? May be some younger people are afraid of getting the penalty ticket, or African Americans are stopped more longer for trying the correct the situation. To analyse these claims some interactive box-plots are created.
In above plot, it can be see that African American people have the highest duration of stop rate, but in Asians (normal and middle/south eastern) there is a weird situation. Although their average duration is shorter from many other ethnicity but they includes many outliers in higher minutes.
After that checked ethnicity, also age groups are check according to duration time. As it expected 60 or over people duration time is shorter than the other age groups, but under 18 age group have the highest avarage duration rate and the longest ranges. However, other age groups duration minute rate are almost exactly same.
Stop Points vs Crash Points
As many people known that the police stop points are generally in the same spots. In the Police Stop dataset there was some geospatial values for stopping spot as longitude and latitude information. To see this spot in the map longitude and latitude values are grouped and counted as how many people stopped there. The most of the stop points had just 1 stopped. However to see the most popular stopping places having more than 100 stop places are selected. In the one place, the police officer stopped 2569 different people.
To see the crashes in the map firstly longitude and latitude values created from point variable because it collected weird way as POINT (…). To create the geospatial information from this variable with the StringR packages some string functions deleted unnecessary characters, then separeted variable into two different variables as longitude and latitude.
summary(df$cnt)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 1.00 3.00 13.62 10.00 2542.00
lon lat cnt
Min. :-122.5 Min. :37.71 Min. : 1.000
1st Qu.:-122.4 1st Qu.:37.75 1st Qu.: 1.000
Median :-122.4 Median :37.77 Median : 1.000
Mean :-122.4 Mean :37.76 Mean : 1.837
3rd Qu.:-122.4 3rd Qu.:37.78 3rd Qu.: 1.000
Max. :-122.4 Max. :37.83 Max. :197.000
NA's :1 NA's :1
After that mapped the longitude and latitude variables, there was no surprising result there was a the most popular place both for crashes and police stops, upper east side of the city. The most crashes are occurred in there, may be the police officers are also the mostly stopped in there for preventing the crashes.
Districts and Actions
In San Francisco, there is many districts. However, in this dataset district levels are more than the number of districts. Because some of district wrote as lower case, others are upper case and some of them as N/A. Firstly, N/A’s are dropped from data and others are converted as upper case.
After that this steps, the district information grouped by and counted and visualized with the word cloud plot. So, the most stop occured in Southern, Mission and Central of San Francisco.
Also, action that have taken is investigated in this project, but also this variable is collected with the weird way, the police officer wrote different action that have taken in one individual with “|” as separator. For dividing the sentences used the seperate_delim_longer function, this separated sentences as different value but it leaves white spaces in the end and beginning of the sentences, and this white spaces deleted with str_trim function.
Therefore, all the actions taken became a level, and visualized it also word cloud plot. Therefore, it can be seen as the most taken action is “Patrol car detention”.
area <- dplyr::select(stop, c("city","district","unique_identifier", "actions_taken"))area$city <-as.factor(area$city)area$district <-as.factor(area$district)head(area)
# A tibble: 6 × 4
# Groups: unique_identifier [6]
city district unique_identifier actions_taken
<fct> <fct> <chr> <chr>
1 SAN FRANCISCO TARAVAL U380119326E22D6100AC_5 Search of person was con…
2 SAN FRANCISCO OUT OF SF / UNK U3801223400D4887116B_1 Property was seized | Ha…
3 SAN FRANCISCO TENDERLOIN U380121054B6A404E1D6_1 Search of person was con…
4 SAN FRANCISCO BAYVIEW U380118193C7D32D472B_1 Search of person was con…
5 SAN FRANCISCO BAYVIEW U38012029453C383DF7A_1 Search of person was con…
6 SAN FRANCISCO OUT OF SF / UNK U3801210484E3C91B19E_1 Search of property was c…
sum(is.na(area$actions_taken))
[1] 0
long <- area %>%separate_longer_delim(actions_taken, "|")long$actions_taken <-str_trim(long$actions_taken, side =c("both")) #if there is a white space it will deletelong$actions_taken <-as.factor(long$actions_taken)#levels(long$actions_taken)#levels(long$district)long$district <-toupper(long$district)long$district <-as.factor(long$district)long1 <-subset(long, district =!"#N/A" )#levels(long1$district)df3 <- long |>group_by(district, actions_taken) |>summarise(count =n()) |>arrange(desc(count))df4 <-subset(df3, actions_taken !="None")#install.packages("wordcloud")df3 <- df3 |>group_by(district) |>summarise(count=sum(count))df4 <- df4 |>group_by(actions_taken) |>summarise(count=sum(count))df3 %>%with(wordcloud(district, count, max.words =30, random.order =FALSE, rot.per =0.35, colors =brewer.pal(8, "Dark2")))